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(— I Abstract. The FIND algorithm is a fast algorithm designed to calculate certain entries of the inverse of 

a sparse matrix. Such calculation is critical in many applications, e.g., quantum transport in nano-devices. 
We extended the algorithm to other matrix inverse related calculations. Those are required for example 
to calculate the less-than Green's function and the current density through the device. For a 2D device 
discretized as an x Ny mesh, the best known algorithms have a running time of 0{N^Ny), whereas FIND 
only requires 0{N^Ny). Even though this complexity has been reduced by an order of magnitude, the matrix 
inverse calculation is still the most time consuming part in the simulation of transport problems. We could 
not reduce the order of complexity, but we were able to significantly reduce the constant factor involved in 
^ the computation cost. By exploiting the sparsity and symmetry, the size of the problem beyond which FIND 

^ is faster than other methods typically decreases from a 130 x 130 2D mesh down to a 40 x 40 mesh. These 

(yr^ improvements make the optimized FIND algorithm even more competitive for real-life applications. 
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Main Notations 



M the set of aU the nodes in the mesh, p. [5] 

T the mesh is subdivided using nested dissection; T is the resulting 

tree of clusters, p. [5] 

C, Cg C is a set of mesh nodes after the partitioning. It is sometimes 

called a cluster. It may designate a cluster at any level in the 
tree. A specific cluster is denoted as Cg, where g is the label used 



when partitioning the whole mesh, as shown in Fig. 1(a) 
C, C_g C is the complement of C with respect to M and C_g = Cg as 
shown in Fig. 1(b) p. [5 



B, Bg set of boundary nodes of a cluster, Bg is the set boundary nodes 

of Cg and sometimes called the boundary set of Cg. P- [T] 

Bl , Qr sets of boundary nodes originated from the left child cluster and 
the right cluster, p. [T3| 



I, \g set of inner nodes. Ig is the set of inner nodes of Cg. Ig n Bg = 

and Ig U Bg = Cg. p. [7] 
S, Sg set of private inner nodes. Sg is the set of private inner nodes of 

cluster Cg. Sg C Ig. p.[7| 
Si, Sn private inner nodes originated from the left child cluster and right 
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cluster, p 

s, b size of the set S and B, p 

N,Nt.,N., size of the mesh from the 
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Hiscretization of 2D device, p. |4j For 
square mesh, — Ny and we write it as N , p. [Sl 
n size of square matrices A, S, G, and G^. For aniVa; x Ny mesh, 

n = N,Ny. p.|4] 

A, A'f the sparse matrix from the discretization of the device and its 

conjugate transpose, p. l3| 
G*", G< matrix associated with the retarded Green's function p. [3] 
S matrix associated with the self energy, p. l3| 

L, U the LU factors of the sparse matrix A, p.l4| 

Lg we decompose the LU factorization of A into multiple partial 

factorizations: L^J^^Lg. p.|4] 
Ag, Ag+ the matrices in the elimination process right before and after 

eliminating Sg, respectively, p. [7] 
Sg, Sg+ we keep updating S as we factorize A and use a subscript to 

indicate the stage in the process. 
C, Cg £g = Lg(Bg, Sg) is thc ouly non-zero off-diagonal block of Lg used 

for the update step for eliminating Sg. Sometime it is simplified 

as C when the subscript g is not important. P-l7| 
U, Ug hi g is a simplified notation for Ag+(Bg, Bg). We Keep computing 

lAg when we calculate G*" . Sometimes lAg is further simplified as 

U when the subscript g is not important, p. [71 
TZ. TZg 7?.g is a simplified notation for Sg+(Bg, Bg). We keep computing 

Rg when we calculate G^. Sometimes TZg is further simplified 

as TZ when the subscript g is not important. P-[9] 
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1 Introduction 



In recent years, the non equilibrium Green's function formalism [H El H] (NEGF) has been widely used 
to study the properties of nanoscale MOS (metal-oxide-semiconductor) transistors as well as nanowires and 
molecular electronic devices in which quantum effects are important. This formalism leads to the definition of 
two types of Green's function, the retarded Green's function and less-than Green's function. Their discretized 
approximations satisfy the following equations fS": 



A = Ho(£) + S'-(f) 



where £ is the energy. Hp is the Hamiltonian representing a single particle, S*" is the self-energy used to 
model the contact of the nano-device (nano-transistors) with the (infinite) source and drain, and [C"]^ is 
the conjugate transpose of C. S is the scattering self-energy matrix. It is typically assumed that S is a 
diagonal matrix, even though we will see later on that this assumption can be relaxed for our approach. The 
charge density can be obtained from the diagonal entries of G<(£). In a typical calculation, one needs to 
solve a coupled Schrodinger-Poisson system of equations using some iterative scheme. Each calculation of 
the diagonal of G<(£) is relatively fast, however the total computational cost can be very large since the 
charge density needs to be computed at all energy levels £, and at each iteration of the Schrodinger-Poisson 
equation solver. 

In this paper, we will focus on discretization schemes for the Hamiltonian operator Hq that use a 2D 
Cartesian grid. The method can be extended to more general schemes including finite-element but the 
description is easier in the context of Cartesian grids. We will further assume that a 5 point finite-difference 
stencil is used, although the method can be extended to arbitrary stencils. This stencil is the most common 
for these types of problems. Finally even though the matrix is not symmetric we require that A^j <^ 
Aji ^ 0. All these requirements can be removed but these assumptions will simplify the discussion of the 
method. 

In previous publications [SI [3 [S], we discussed an efficient method, FIND (Fast Inverse using Nested 
Dissection), to calculate the diagonal of C — A^^. The novel contributions in this paper include: 



Extension of FIND 017113 ^oi computing the diagonal of G< = A^^SA^^ See Section 3.1 



Extension for computing the off-diagonal entries of C. This is required for computing the current 



density (see for example |8], eq. (20) pp 7847). This is presented in Section 3.2 



Optimizations using certain sparsity patterns in A to further reduce the computational cost (Sec- 



tion 4.1) 



• Optimizations using the symmetry of A (Section |42j). 

• Optimizations specific to computing the off-diagonal entries for current density calculations. 

• Proof of symmetry and/or positive definiteness for matrices arising in the algorithm, when A is itself 



symmetric and/or positive definite. See 4.2 



• Detailed analysis of the computational cost and storage requirement for the different approaches. 

Pseudo-codes are given for the key algorithms. Numerical results with accuracy and performance benchmarks 
are given at the end (Section [S]). The performance improvements vary but are on the order of 2x in general. 
In the appendix, we derive a series of technical results that are required to prove some of the statements in 
this paper. 

Before beginning, we make some general comments regarding our approach and how it relates to other 
techniques. In [21 [10], methods are proposed to calculate the inverse of a dense matrix, of size iV x A^, in 
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0{N\nN) steps using low rank approximations of certain off-diagonal blocks. Similar techniques include 
?^^-matrices, quasi-separable matrices, semi-separable matrices, and HSS-matrices [TTl [T2 US] . Mamaluy 
et al.'s method uses the special dependency of the problem on £ to reduce the computational cost [T4 | [T5 | 
UM [T71 IT51 [TO] . This last method requires computing the eigenvectors and eigenvalues of the decoupled 
Hamiltonian. In |20| . a method is proposed in cases where the inverse matrix exhibits a decay property, i.e., 
when many of the entries of the inverse are small. In this paper, we seek to find exact methods that do not 
require special matrix properties or solving an eigenvalue problem. 

An efficient approach in this second class of methods is the recursive Green's function (RGF) method [21] 
(see also the earlier work of Erisman et al. [22]) with running time of order 0{N^Ny), where and Ny 
are the number of points on the grid in the x and y direction after discretization, respectively {N^ < Ny). 
This technique was generalized in [53] to block tri-diagonal matrices. RGF type methods perform a forward 
recurrence, based on LU factorizations. This produces a single entry in the inverse. From this entry, a 
backward recurrence progressively computes the remaining entries. In contrast, FIND [6j[7|[5] performs two 
elimination passes, based on LU factorizations, typically described as upward elimination (similar to the LU 
factorization in RGF), and a downward elimination (which has no equivalent in RGF). At the end of the 
process, the inverse is directly computed as a by-product of the downward elimination. The work of Lin et 
al. [231 US] follows an approach similar to FIND. 

As a note, we mention that the algorithm presented below uses a decomposition of the problem based 
on nested dissection. The original nested dissection of George et al. [26], which is used to solve linear 
systems, requires defining separator sets. A separator set is a set of mesh nodes such that it divides the 
mesh in two sets Ci and C2, and = if i € Ci and j € C2. An important difference between nested 
dissection and FIND is that, in the original nested dissection, a separator set is "shared" by the neighboring 
clusters. In FIND, since we need more independence among clusters, each cluster has its own boundary 
set that separates it from other clusters. These boundary sets are similar to the separator sets (generally 
called width-1 separators) used in nested dissection, but form so-called width-2 separators, with basically 
one "separator" for each of the two neighboring clusters. See for example [571 UHl [Ml 130] for a definition 
of width-Z separators. Width-Z separators with I > 1 are sometimes called wide separators. However, it is 
possible to construct a FIND algorithm that uses the more classical width-1 separators. Currently, FIND 
belongs to the class of right-looking algorithms (see 01] page 426, 427). A variant of FIND requiring only 
width-1 separators can be created using a left-looking strategy for the additions}^ Even though the latter 
might be more efficient, in this paper, we focus on the original FIND scheme, which uses a right-looking 
approach with width-2 separators. 



2 Previous work 

2.1 Brief description of the FIND algorithm for = 

The basic idea of the FIND algorithm is to simultaneously perform many LU factorizations on the given 
matrix to obtain the diagonal elements of its inverse. By performing the LU factorizations in a specific order 
that minimizes fill-ins 2B], we preserve most of the sparsity of the given matrix A, and thus make the LU 
factorizations very efficient. Once an LU factorization is complete, we can easily compute the last entry 
on the diagonal of the inverse: for an n x n matrix (C)^^ = LU, we have GJ^„ = 1/U„„. Although we 
can compute only the (n, n) entry of the diagonal by this formula, we can choose any node and reorder the 
original matrix to make that node correspond to the (n, n) entry of the reordered matrix. In this way, all 
the diagonal elements of C can be computed. 

If we have to perform a full LU factorization for each of the n reordered matrices, the algorithm will not 
be computationally efficient even though each LU factorization is very fast. However, we can decompose each 
LU factorization into partial LU factorizations: A = Lg Yin Ug, where Lg corresponds to the elimination 



of Sg and Sg is a set of private inner nodes that we will discuss in Section 2.2 Due to the sparsity pattern of 



A, we can perform them independently and moreover, if we reorder A properly, many partial factorizations 



^Specifically, this variant uses right-looking multiplications and left- looking additions. 
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for different reordered matrices turn out to be identical. As a result, we can reuse the partial results multiple 
times for different orderings thereby reducing considerably the computational cost. 



More specifically, we partition the mesh into subsets of mesh nodes. Fig. 1(a) shows the partitions of 
the entire mesh into 2, 4, and 8 clusters, for a 2D rectangular grid. Then, these subsets are merged to 
form larger clusters. Each merge corresponds to a partial Gaussian elimination. The clusters shown on the 



left panel 1(a) are called basic clusters and usually denoted as C. They are labeled with positive integers 
therefore sometimes called positive clusters. For reasons we will see soon, we also consider the complement 
of the basic clusters with respect to the whole mesh M. These clusters are called complement clusters and 
denoted as C. They are labeled with negative integers therefore sometimes called negative clusters as shown 
in Fig. [T(b)] 
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(a) Partition into basic clusters. 
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(b) Partition into basic and 
complement clusters. 



Figure 1: Partitions of a 2D rectangular mesh into clusters 

We organize these basic and complement clusters in trees to see how the partial eliminations are related 
to the full elimination. Fig. 2 (a) [ shows how the basic clusters form a basic cluster tree, usually denoted as 
T. For each of the leaf clusters (target cluster) in the figure, we order the matrix A, such that the entries 
corresponding to the target cluster (target entries) form a block at the end of the matrix. We then eliminate 
all the columns corresponding to the complement of the target cluster to compute the target entries in A~^. 




(a) The basic cluster tree of a mesh. (b) The complement-cluster tree of a mesh. 

Figure 2: Cluster trees. 

Because the complements of the target clusters overlap significantly, we organize these clusters, together 

, ^ Unlike the 

clusters in Fig. 



2(a) 



with other complement clusters, into a similar complement cluster tree, as shown in Fig. 2(b) 

a parent cluster in Fig. |2(b)| is not a union of its two child clusters, but rather their 



intersection. 

The two cluster trees in Fig. [3] show how two leaf complement clusters —14 and —15 are computed by the 
FIND algorithm. Such trees are called augmented trees and are denoted as T+ with r indicating the target 
cluster, i.e., the complement of the root cluster. Each tree is associated with an ordering of A with the target 
entries appearing at the end of the matrix. The path (— 3)-(— 7)-(— 14) in Tjj and the path (— 3)-(— 7)-(— 15) 
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Figure 3: Two augmented trees with target clusters 14 and 15. 



At each level of the two trees, two clusters are merged into a larger one. We define the inner nodes of 
a cluster C as the subset of nodes I in a cluster such that Ay = for any i G I and j ^ C. Each merge is 
equivalent to the Gaussian elimination of all the inner nodes of the resulting cluster. For example, in Fig. |4j 
the large cluster has been surrounded by a solid line with tics. The two children clusters are on the left and 
right. During earlier steps in the method, the x nodes have been eliminated. At the current step, the o 
nodes are being eliminated. As a result, all the inner nodes of the clusters have been eliminated at the end 
of this operation. 
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remaining nodes 
nodes already eliminated 
inner nodes to be eliminated 
boundary nodes isolating inner 
nodes from remaining nodes 



Figure 4: The nodes within the rectangular frame form a large cluster. Two child clusters are shown on the 
left and right. 

Such a merge corresponds to an independent partial LU factorization since there is no connection between 
the eliminated nodes (o) and the remaining nodes (a). This is better seen below in the Gaussian elimination 
in ([T]) for the merge in Fig. |4] 

A(o,o) A(o,.) \ /A(o,o) A(o,.) \ 
A(.,o) A(.,.) A(.,A) A*(.,.) A(.,A) , 

A(A,») A(A, A)/ \ A(A,») A(A,A)/ 

where A*(.,.) = A(.,») - A(., o)A(o, o)-1A(o, •) (1) 

In ([I]), o and • each corresponds to all the o nodes and • nodes in Fig. |4| respectively. A*(», •) is the result 
of the partial LU factorization. It will be stored for reuse when we go through the cluster trees. 

To make the most of the overlap, we do not perform the eliminations for each augmented tree individually. 
Instead, we perform the elimination based on the basic cluster tree and the complement-cluster tree in the 
FIND algorithm. We start by eliminating all the inner nodes of the leaf clusters in Fig. |2(a)| followed by 
eliminating the inner nodes of their parents recursively until we reach the root cluster. This is called the 
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upward pass. Once we have the partial ehmination results from upward pass, we can use them to traverse 
the complement-cluster tree level by level, from the root to the leaf clusters. This is called the downward 
pass. 



2.2 Formal description of the FIND algorithm 



Section |2.1| briefly illustrated how the algorithm works by showing that the eliminations of inner nodes are 
independent of each other (Eq. ([T])) and such eliminations are mostly shared among different augmented 
trees (Fig.jsl). Below we will show in a more formal way why the basic FIND algorithm works. 

In Fig. |4j the • nodes are called the boundary nodes, denoted as B^. The inner nodes consist of the o and 
X nodes, denoted as 1^. We call the o nodes private inner nodes and denote them as S^. Private inner nodes 
are the nodes that actually get eliminated when we process the cluster. To guarantee the independence of the 
eliminations shown by ([l]) and the overlap of the eliminations for different target nodes illustrated by Fig. |3j 
we need to order the original matrix A such that all the columns corresponding to the private nodes of each 
cluster in Fig. [l]are grouped together. It can be shown that for each target cluster, the whole mesh can be 
partitioned as a series of sets of private inner nodes (S^), the set of boundary nodes Bf of the complement 
of the target cluster, and the target cluster C^. We order the columns in A based on the ordering of these 
sets: Si of the leaf clusters first, followed by Si of their parents. Si of the clusters in the next level in the 
augmented tree, . . ., Sf, Bf, and finally C^. We call such ordering a consistent ordering with respect to a 
target cluster. Note that the consistent ordering is not unique since we do not require any specific ordering 
of clusters at the same level in the augmented tree. Here is an example of a consistent ordering with respect 
the target cluster 14: Ss, Sg, Sio, Sn, S4, S5, S12, S13, Sg, Sg, S7, S15, S14, B14, and C14. 

For notation convenience, we write Si < Sj if Si appears before Sj in a ordering and write Us <SiSj as S<i 
and (Usi<SjSj) U B_r U C,. as S>i. We also denote as Ag the matrix A after all the columns corresponding 
to the private nodes appearing before Sg have been eliminated and denote as Ag_|_ the matrix with all the 
columns before and including Sg have been eliminated. When necessary, we write them as A^^g and Ar^g+ 
to indicate explicitly the target cluster r of the elimination process. Now, we can rewrite the one-step 
elimination illustrated by ([l]) more formally as: 



A„ = 



^Ag(S<g, S<g) 




V 



<gi Sg) 



Ag(S 

A, 

Ag(Bg, Sg) 





A<,(S<g 



gi Sg) 



A 
A 



s(Sg7 

Bg) 



,Bg) 

B,) 



Ag(S 



c 



\B.) \ 



Ag(Bg,S>g\Bg 



Ag(S'>g\Bg, Bg) Ag ( S > g \ Bg , S'> g \ B g ) / 



A<,+ = 



/Ag(S<g,S<g) 




V 



Ag(S 



<9' 



Sg) Ag(S 



Ag(Sg, Sg) 






<SJ Bg) 



Ag(Sg,Bg) 

Ag+(Bg, Bg) 



Ag(S<g,S>g\Bg) 



Ag(Bg,S>g\Bg) 



Ag(S>g\Bg, Bg) Ag ( S > g \ B g , S > g \ Bg ) / 



where 



Ag+(Bg, Bg) 



Ag(Bg, Bg) 



Ag(Bg, Sg)Ag(Sg, Sg) Ag(Sg,Bg 



(2) 



For notation simplicity, we write Ag4.(Bg, Bg) as Ug. This is the block we keep computing in our algorithm. 
Also recall that A — JJ^ Lg Yl^ Ug, where Lg corresponds to the elimination of Sg. So we have Ag+ = L~^Ag 
with L~-^(Bg, Sg) = — Lg(Bg, Sg ) = " Ag(Bg, Sg)Ag(Sg, Sg ) " ^ the ouly nou-zcro off-diagonal block of Lg (see 
Li ct al. for more details) [5,. We can further define 



— Lg(Bg, Sg), 



(3) 



and then we can write ^ as 



Ag+(Bg, Bg) - Ag(Bg, Bg ) - £gAg{Sg, Bg). 



(4) 
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Note that in the above matrices, for notation convenience, Ag(», Bg) is written as a block. In reality, 
however, it is usually not a block in A of any ordering in our algorithm because the columns of A are ordered 
based on Sg but not Bg. It can be shown that the matrix preserves the sparsity pattern required by the 
above formula during the elimination process, which leads to the following theorem: 

Theorem 1 For any target clusters r and s such that Cg G and Cg G T+, we have 

Ar.g+{Bg, Bg) = As,g+(Bg, Bg ) . 

Theorem [l] shows that the partial elimination results are common for matrices with different orderings during 
the elimination process. This is the key foundation of the basic FIND algorithm. For more details, please 
see |S]. 



2.3 Cost analysis of the FIND algorithm for A ^ 

For analytic simplicity, we assume that A comes from a square grid of size N x N . In this case, the total 
cost is 0{N^). This is the same computational cost in the O sense as the nested dissection algorithm of 
George et al. [35] . It is now apparent that FIND has the same order of computational complexity as a single 
LU factorization, even though it calculates the diagonal of the inverse matrix. 

Similar analysis tells us that the total memory cost is C(A^^ log(iV)). Both costs are asymptotically 
better than those for the best known algorithm, RGF, given by Svizhenko et al. [21]. However, the constant 
factor in RGF [35] is smaller than ours [S]. As a result, with the chosen tree decomposition, we are slower 
than RGF for meshes smaller than 130 x 130. The single-layer-separator FIND variant discussed in the 
introduction would make FIND as fast as RGF even for small matrices. The following two sections will 
explain how to exploit the sparsity and symmetry to reduce the constant factor in FIND. 



3 Extension of the FIND algorithm 

In addition to computing the diagonal entries of the matrix inverse, the algorithm can be extended to 
computing the diagonal entries and certain off-diagonal entries of = A^^SA^^, which are required for 
the charge and current densities. 

3.1 Computing the diagonal entries of 

Intuitively, A^^SA^^ can be computed in the same way as A~^ because we have 

G< = A-iSA-t 
=> AG<At=S 
=> UG<Ut = L-iSL-t 
^ [G<]„„= (U„„)'i(L"'SL-t)„„(U„„)-t 



Now, it remains to show how to compute (L~^5]L~^)„„ efficiently. In the appendix, we show that if 
we order S in a same way as A, the pattern of S will be similar to that of A during the update process. 
Therefore, we can decompose the computation for the last block of L~^5]L^^ into multiple steps with each 
step involving only a few small blocks. More specifically, we have a sequence of matrices Sg that start from 
S and end at (L^^SL^'f)„„. This sequence is synchronized with the updates on A, and similar to the 
update Ag+ = Lg ^Ag, we have Sg+ = L" ^SgLg'''. This one-step update is given by: 



^9 



Sg(S<g, S<g) g {S < g , S g) Sg(S<g, Bg) 

S5(Sg,S<g) 12g{Sg,Sg) Sg(Sg, Bg) 

Sg(Bg, S<g) Sg(Bg, Sg) Sg(Bg, Bg) S g ( Bg , S > g \ B g ) 

\ Eg(S>g\Bg) S g ( S > g \ Bg , Bg ) S g ( S > g \ B g , S > g \ Bg ) ^ 



(6) 
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where 



9{S<9,S<g) Sg(S<g,Sg) Sg+(S<g,Bg) Sg ( S < g , S > g \ Bg ) 

^g{Sg,S<g) Sg(Sg,Sg) Sg+(Sg,Bg) 

g+(Bg,S<g) Sg+(Bg,Sg) Sg+(Bg,Bg) Sg ( B g , S > g \ Bg ) 

^9(S>9\Bg) Sg(S>g\Bg, Bg) S g ( S > g \ Bg , S > g \ Bg ) , 

Sg+(Bg, Bg) = Sg(Bg, Bg) " £gSg(Sg, Bg) 

- ^9(^5, Sg)£], + £gSg(Sg, Sg)£],, 

The difference between the updates on A and the updates on S is that the latter will not only change 
Sg(Bg, Bg), but rIso Sg(S<g, Bg) Rud Sg(Bg,S<g). Thcsc bloclcs are underlined in ([6|. The reason for these 
changes is that we multiply S by L^^ from both sides and the subdiagonal blocks of Sg are not cleared 
through the update process. The changes to Sg(S<g,Bg) and Sg(Bg,S<g), however, do not invalidate the 
update process because in any future update step h > g, we are only interested in the operation on the 
submatrix S/j(S>;i, S>;i) and S<g n S>h = 0- See the appendix for a rigorous description and proof. 

Note that even though ^r,g+ generally depends on r (corresponding to the ordering for different target 
clusters), Sr,g+(Bg, Bg) does not, as shown in the appendix. For notation simplicity, wc write Sg-|_(Bg, Bg) 
as TZg. The major part of the computation for is to compute TZg (for G^), along with Ug (for both G'' 
and G^), for all clusters. 



3.1.1 The algorithm and pseudo-codes 

Let Ci and Cj be the two children of Cg in T+. To compute TZg, we need both Ag(Sg U Bg,Sg U Bg) and 
Sg(Sg U Bg,Sg U Bg). Ag(Sg U Bg,Sg U Bg) foUows the Same update rule as in the basic FIND algorithm; 
and Sg(Sg U Bg. Sg U Bg) is given by the following four blocks: 

/s,(SgnB„SgnB,) s(SgnB„SgnB,) 
^a[^g,^9) l^s(SgnB„SgnBO (Sg n B„ Sg n B,) 

/s,(SgnB„BgnB,) 
^9[^9^^9)-[ s,(SgnB„BgnB,) 

/s,(BgnB„SgnB,) 
^^^'^^"1 s,(BgnB„SgnB,) 



and 

^9(69, Bg) 



S,(Bg n B,, Bg n B,) S(Bg f] B„ Bg f] Bj) 

S(Bg n B,, Bg n B,) S,(Bg H B,, Bg n B,; 



To compute G^, we start from the given A and S and keep applying the update rule ([t]). Similar to the 
basic FIND algorithm, if Cg is a leaf node, then we have Sg(Bg U Sg, Bg U Sg) = S(Bg U Sg, Bg U Sg), i.e., 
we can use the entries in the original S (see appendix) . The recursive process of updating S is the same as 
updating A in the basic FIND algorithm, but the update rule is different here. 

The pseudocode of the algorithm is an extension of the basic FIND algorithm. In addition to rearranging 
A and computing matrices Ug , we need to perform similar operations for S and compute matrices TZg . Note 
that even if we do not need A~^, we still need to keep track of the update of A, i.e., the U matrices. This 
is because we need Ag(Bg U Sg, Bg U Sg) to update Sg(Bg, Bg) and obtain TZg. Once we have prepared TZg 
for all the positive clusters, we can compute TZg for the negative clusters. This is done in the downward 
pass as shown in the procedure updateAdjacentNodes below. The whole algorithm is shown in procedure 
computeG^. In these procedures, we use slightly different notations. Instead of using Cg, Bg, and Sg for 
the sets of boundary nodes and private inner nodes, we use C, Be, and Sc. We also use Uc and TZc for the 
corresponding Ug and TZg , respectively. 
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Procedure updateBoundaryNodes(cluster C). This procedure is called with the root of the tree: 
updatcBoundaryNodes(root) . 

Data: tree decomposition of the mesh; the matrix A. 

Input: cluster C with n boundary mesh nodes. 

Output: all the inner mesh nodes of cluster C are eliminated by the procedure. The n x n matrices 
Uc for C and TZc for are computed and saved. 

1 if C is not a leaf then 

2 CI = left child of C; 

3 C2 = right child of C; 

4 updateBoundaryNodes(Cl) /* The boundary set is denoted Bci */; 

5 updateBoundaryNodes(C2) /* The boundary set is denoted Bc2 */; 

6 else 

7 I Ac = A(C,C); 

8 if C is not the root then 
Ac = [Uci A(Bci, Bc2); A(Bc2, Bci) Uc2]; 

/* A(Bci,Bc2) and A(Bc2,Bci) are values from the original matrix A. */ 

Rearrange Ac such that the inner nodes of Bci U Bc2 appear first; 
Set Sg in Eq. Q to be the above inner nodes and Bg the rest; 
Eliminate the inner nodes to compute Uc by Eq. ([2| ; 

— Pci S(Bci, Bc2); S(Bc2, Bci) ^€2]', 
/* SCBci, Bc2) and S(Bc2j Bci) ai'e values from the original matrix E. */ 
Rearrange Sc in the same way as Ac; 
Compute TZc based on Eq. ([t]); 
Save Uc and TZc] 
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3.1.2 Computation and storage cost 

Similar to the computation cost analysis in |5|, the major computation cost comes from computing ([7|. 

The cost depends on the size of Sg and Bg. Let s = |Sg| and b = |Bg| be the sizes, then the computation 
cost for £gSg(Sg, Bg) Is ( ^ s'^ + s^b + sb^) flops. Since Cg is already given in the cost is reduced to sb^. 
Similarly, the cost for Sg(Bg,Sg)£j^ is also sb^ and the cost for £gSg(Sg, Sg)£^ is sb^ + s^b. So the total 
cost for Q is (3sb^ + s^b) flops. Note that these cost estimates need explicit form of A(B, S)A(S, S)^^ and 
we need to arrange the order of computation to have it as an intermediate result in computing ([2|. 

Now we consider the whole process. We analyze the cost in the two passes. 

For the upward pass, when two (a x a)-clusters merge into one (a x 2a)-cluster, b < 6a and s < 2a, so the 
cost is at most 360a^ flops; when two (a x 2a)-clusters merger into one (2a x 2a)-cluster, b < 8a and s < 4a, 
so the cost is at most 896a^ flops. We have (a x 2a)-clusters and ^^^^ (2a x 2a)-clusters, so the cost 

for these two levels is (SGOa^^^^ + 896a^^^^ ^)AOANxNya. Sum aU the levels together up to a = iV^/2, 
the computation cost for upward pass is then AOAN^Ny. 

For the downward pass, when one (2a x 2a)-cluster is partitioned to two (a x 2a)-clusters, b < 6a and 
s < 8a, so the cost is at most 1248a'^ flops; when one (a x 2a)-cluster is partitioned to two (a x a)-clusters, 
b < 4a and |Sg| < 6a, so the cost is at most 432a^ flops. We have -^72^ (a x 2a)-clusters and ^2 " (a x a)- 
clusters, so the cost for these two levels is (1248a^ ^20^" + 432a^ Af» _^ lObGN^Nya. Sum all the levels 
together up to a = Nj;/2, the computation cost for upward pass is then at most lObQN^Ny. 

3.2 Computing off-diagonal entries of and 

In addition to computing the diagonal entries of C and G^, the algorithm can also be easily extended to 
computing the entries in G*" and G^ corresponding to neighboring nodes. These entries are often of interest 
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Procedure updateAdjacentNodes (cluster C). This procedure is called with the root of the tree: up- 
date Adj acentNodes (root ) . 

Data: tree decomposition of the mesh; the matrix A; the upward pass [updateBoundaryNodes()] 

should have been completed. 
Input: cluster C with n adjacent mesh nodes (as the boundary nodes of C). 
Output: all the outer mesh nodes of cluster C (as the inner nodes of C) are eliminated by the 

procedure. The n x n matrices Uq and TZ^ are computed and saved, 
if C is not the root then 

D = parent of C /* The boundary set of D is denoted Bjj */; 

Dl = sibling of C /* The boundary set of Dl is denoted Bm */; 

Ac = [Uo A(Bd, Bdi); A(Bdi, Bq) Uqi]; 

/* A(Bj5,BDi) and ACBdi.Bi)) are values from the original matrix A. */ 
/* If D is the root, then D = and Ag = Rdi • */ 

Rearrange Ag such that the inner nodes of Bg U Bqi appear first; 
Set Sg in Eq. ([t]) to be the above inner nodes and Bg the rest; 
Eliminate the outer nodes to compute by Eq. ([2| ; 

= ["^D 5](Bd,Bdi); S(Bdi,Bd) TZdi]; 
Rearrange in the same way as for Ag; 
Compute TZ^ based on Eq.([7]); 
Save and TZq ; 

12 if C is not a leaf then 
CI = left child of C; 
C2 = right child of C; 
update Adj acentNodes (CI); 
updateAdjacentNodes(C2); 



Procedure computeG^(mesh M). This procedure is called by any function that needs G< of the whole 
mesh. 

Input: the mesh M; the matrix A; the matrix S. 

Output: the diagonal entries of G^. 

1 Prepare the tree decomposition of the whole mesh; 

2 updateBoundaryNodes(root); 

3 updateAdjacentNodes(root); 

4 for each leaf node C do 
Compute [A"i](C, C) using Eq. (|8|; 
Compute Gc based on Eq. ([s]); 
Save Gc together with its indices; 

8 Collect Gc with their indices and output all the diagonal entries of G<; 
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in simulation. For example, the current density can be computed using off-diagonal entries corresponding to 
horizontally neighboring nodes. These entries can be obtained with a slight modification of the algorithm. 
For simplicity, we only consider the off-diagonal entries of C. The off-diagonal entries of G< can be treated 
in a similar way. 
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(a) A leaf cluster with its neighboring 
nodes. The arrows correspond to the en- 
tries for the current density that are being 
computed. 



1 2 3 4 5 6 7 8 9 10 11 12 



(b) Matrix entries. 



Figure 5: Last step of elimination. The entries in the shaded area are obtained through computing C. The 
solid blocks and the patterned blocks are the off-diagonal entries for the current density. The solid blocks are 
obtained for free when we compute the diagonal entries of C. The patterned blocks need to be computed 
separately. 

Fig. [5] shows a small number of mesh nodes and their corresponding matrix entries for the computation 
of the diagonal entries C and the current density. The entries of = A^^ corresponding to the nodes in 
C are being computed with the following equation: 

[A-i](C, C) - [A(C, C) - A(C, Bc)(Wc)-iA(Bc, (8) 

The current density between nodes 9 and 11, and that between nodes 10 and 12 are also given by ([s]). To 
compute the current density between nodes 11 and 3 and that between nodes 12 and 4, perform one step of 
back substitution and we have 

[A-i](Bc, C) = -(Wc)"'A(Bc, C)[A-i](C, C) (9) 

and 

[A-i](C,Bc) = -[A-i](C,C)A(C,Bc)(Z^c)"'- (10) 
This can be generalized to nodes that arc farther away. 



4 Optimization of FIND 

In our FIND algorithm, we achieved 0{N^Ny) computation complexity for a 2D mesh of size x Ny. Even 
though this complexity has been reduced by an order of magnitude compared to the state-of-the-art RGF 
method, the matrix inversion is still the most time consuming part in transport problem simulations. This 
chapter discusses how to further reduce the computational cost by using certain properties of the matrix A 
such as its sparsity pattern. 
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In the FIND algorithm, the major operation is performing Gaussian ehminations. AU such operations 
are of the form 

U = A(B, B) - A(B, S)A(S, S)-^A{S, B) (11) 



for C and 
for G<, where 



n = S(B, B) - £I](S, B) - S(B, S)d^ + £S(S, S)C 



(12) 



£ = A(B,S)A(S,S)-^ (13) 
These equations are copied from ([2| and ([7| with subscripts skipped for simphcity|^ As a resuh, computing 



(111 and (12) efficiently becomes critical for our algorithm to achieve good overall performance. 



Simple analysis [51 shows that the computation cost for (111 is 

^s^ + s^b + sb^ 
3 



and for ( 12 ) is 



s^b 



3sb'' 



(14) 



(15) 



where s and b are the size of sets S and B, respectively. These costs assume that all the matrices in (11 1 and 



(12 1 are general dense matrices. These matrices, however, are themselves sparse for a typical 2D mesh. In 



addition, due to the characteristics of the physical problem in many real applications, the matrices A and S 
often have special properties |21j . Such sparsities and properties will not change the order of cost, but may 
reduce the constant factor of the cost, thus achieving some measure of speed-up. With proper optimization, 
the FIND algorithm can exceed other algorithms on a much smaller mesh. 

In this section, we first exploit the sparsity of the matrices in ([2| and ([t]) to reduce the constant factor 
in the computation and the storage cost. We then consider the symmetry and positive dcfinitcness of the 
given matrix A for further performance improvement. Finally, we also apply these optimization techniques 
to and current density, when S has similar properties. 



4.1 Optimizations that depend on the sparsity of A 

Because of the local connectivity of the 2D mesh in our approacfj^ the matrices A(B, S), A(S, S), and A(S, B) 
in (111 are not dense. Such sparsity will not reduce the storage cost since the matrix U in ([Tl|) is dense and 



this is the major part of the storage cost. However, the computational cost can be significantly reduced. To 
clearly observe the sparsity and exploit it, it is convenient to consider these matrices in blocks. 



4.1.1 Schematic description and analysis of the sparsity pattern 

As shown earlier in Fig. [4j we eliminate the private inner nodes when we merge two child clusters in the tree 
into their parent cluster. In Figs.[6j we distinguish between the left and right clusters in the merge (compare 
with Fig. |4|^ 

In Fig7]6j the three hollow circle nodes and the three hollow square nodes are private inner nodes of the 
parent cluster. They originate from the left child cluster and the right child cluster and are denoted as Sl 
and Sfl, respectively. When we merge the two child clusters, these nodes will be eliminated. Similarly, the 
solid circle nodes and the solid square nodes are boundary nodes originated from the left and the right child 
clusters and are denoted as B^ and B/f, respectively. When we merge the two child clusters, these nodes will 

^We use loose notations here since some of the entries of A(S, S) and S(S, S) have been modified by the Gaussian elimination 
compared to the original A and S. See Section |3.1| for details. 
^The approach works for 3D mesh as well 
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Figure 6: Two clusters merge into one larger cluster in the upward pass. The x nodes have already been 
eliminated. The o and □ nodes remain to be eliminated. This figure is slightly different from Fig. |4] as we 
distinguish the nodes in the left and right clusters here. 



be updated. The matrix blocks corresponding to the circle and square nodes can be written as 

\ 



/u(o,o) 
A(n,o) 
u(«,o) 

V 



A(o,n) 

U(D,D) 


u(B,n) 



u(o,«) 



u(«,«) 

A(B,«) 





u(n,i 

A(«,l 
U(H,I 



(16) 



2.2 



and the merge corresponds to the elimination of the first two columns of ( 16 ). If we follow the notation used 
then (16) corresponds to Ak{Sk U Bfc,Sfc U where cluster k has two child clusters i and 



in Section 

j, and the circle and square nodes are given by 
3fcnB 



O: 


Sl 


□ : 


Sr 


• : 


Bl 


■ : 


Br 



I, private inner nodes from the left cluster i 
j , private inner nodes from the right cluster j 



= Bfc n B 
= Bfc n B 



boundary nodes from the left cluster i 
■j , boundary nodes from the right cluster j 



In ( 16 1 we use U instead of A for some blocks to emphasize that they are results from earlier eliminations 
while the other blocks come from the original matrix Aj^ These U blocks are typically dense, since after the 
elimination of the inner nodes, the remaining nodes become fully connected. In contrast, the A blocks are 
typically sparse. For example, since each O node is connected with only one □ node in Fig. [6j A(0, □) and 
A(n,0) are both diagonal. A(#,H) and A(H,#) are almost all except for a few entries, although such 
sparsity saves little cost because they are only involved in addition operations. 

These properties always hold for the upward pass. They hold for the downward pass as well but with a 
few exceptions. These exceptions are illustrated by the patterned nodes in Fig. [7] Since the exceptions only 
occur at the corner nodes, we can ignore them in the computational and storage cost analysis. 



^We use k instead of g here. 

^We use loose notation liere. The notations U for some of the blocks (e.g., U(« 



•)) is incorrect since this is not the final U 
matrix from the LU factorization, but rather an intermediate step in the method. This notation is closer to a computer code 
implementation. More precise notations however would reduce the legibility of the text. 
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Figure 7: Notations are the same as in Fig. [6j with some nodes patterned. The A nodes are not modified in 
this merging step. The patterned symbols correspond to nodes whose connectivity is different from Fig. |6] 
As a result, A(o, □) and A(n, o) are no longer strictly diagonal and A(o, ■) is not exactly zero any more. 



4.1.2 Exploiting the sparsity using a block structure 

We analyzed the computational cost using different formulas to calculate the inverse of a general 2x2 block 
matrix. The following forms of the inverse are available}^ 



CD) ^ \-D-^CA-^ b-^ 

A-^ + A-^BD-^CA-^ -A-^BD-^ 
-D-^CA-^ D-^ 

A ByW I 0^^^ 



bl \CA-^ I 



(17a) 
(17b) 
(17c) 



where A = A — BD and D = D — CA ^B arc the Schur complements of 13 and A, respectively. In (17a|, 
the calculations for A~^ and 13"^ are independent of each other and can be done in parallel, so we call 



this method parallel inverse. In (17b), we have to calculate D ^ first to calculate the other block of the 



inverse: A^^ + A^^BD^^CA^^, so we call this method sequential inverse. In (17c|, we perform block LU 
factorization first and then calculate the inverse, so we call this method block LU inverse. 

Details of our derivation can be found in the appendix. Using our analysis of the sparsity pattern, the 
cost for each approach is shown in the table below: 



Table 1: Summary of the optimization methods. 



Method 




Cost 




Reduction in four case^^ 


Parallelism 


parallel inverse 




+ 4m^7i -t 


- 4mn^ 


51.4, 52.6, 61.5, 60.1 


good 


sequential inverse 




+ 5m^n - 


- Amn^ 


53.0, 54.0, 55.8, 55.7 


httle 


block LU inverse 




+ 4m^n -\ 


- 5mn^ 


59.1, 57.9, 53.9, 54.3 


some 


naive LU 




+ ^m^n - 


■f Amn^ 


59.0, 62.5, 76.0, 74.4 


httlc 



Note that these methods are not exhaustive. There are many variations of these methods with different 
forms of A(S,S)^^, different computation orders, and different common parts in the computation. Deter- 



°We assume A and D to be nonsingular. 
''The percentage of the cost given by l|14| 
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mining which method is the best may depend on the size of Sl, Sr, B^, and Bji, the exceptions in the 
downward pass (corner nodes), the flop rate, the cache miss rate, and the implementation complexity. 



4.2 Optimization for symmetry and positive definiteness properties 

In real problems, A is often symmetric (or Hermitian if they are complex, which can be treated in a similar 
way) [21) . So it is worthwhile to consider special treatment for such matrices to reduce both computation 
cost and storage cost. Note that this reduction is independent of the optimizations based on the sparsity 
pattern in Section. |4.1[ 

If all the given matrices are symmetric, it is reasonable to expect the elimination results to be symmetric 



as well since our update rule (11) does not break matrix symmetry. This is shown in the following property 
of symmetry preservation: 



Property 1 (symmetry preservation) If A is symmetric, then in (11), U, A{B,B), anrf A(S,S) are all 
symmetric; A(B,S) and A(S, B) are transpose of each other. 

See |A] for a proof. 

In addition, A is often positive definite. This property is also preserved during the elimination process 
in our algorithm, as shown in the following property of positive-definiteness preservation: 

Property 2 (positive-definiteness preservation) If A is symmetric and positive- definite, then the ma- 



trix U in (11) is always positive- definite. 



See [B] for a proof. Using the symmetry preservation property and the positive definiteness preservation 



property, we can write the last term in ( 11 ) as 



A(B, S)A(S, S)-iA(S, B) = {G^^AiS, B)f{g^'A{S, B)), 



where A(S,S) = ^s^s" ^^^'^ Cholesky factorization of A(S,S). The Cholesky factorization has cost ^s^^ 



The solver has cost ^s^b, and the final multiplication has cost \sh^ . The cost for (11) is then reduced by 
half from + s^b + sb^ to ^s^ + ^s^b + ^sb^. 

Note that even if A is not positive definite, by the symmetry preservation property alone, we can still 



write the last term of (11) as 



A(B, S)A(S, S)-iA(S, B) = {C^^A{S, B)f-D^\C-^^A{S, B)), 

where A(S, S) = £sDsi3s is the LDL"^ factorization of A(S, S). 

Similarly, the computation cost is reduced by half. However, this method may be subject to large errors 
due to small pivots, since the pivots can only be selected from the diagonal entries [33,- The diagonal pivoting 
method [311 ISS] can be used to solve such a problem, but it is computationally more expensive. 

Combining symmetry and sparsity For symmetric and positive definite A, we can take advantage of 
the sparsity pattern discussed in Section [4?!] as well. Consider the following block Cholesky factorization of 
A(S,S): 

where An = t/iCjf , A22 = A22 - Af2Aj;/Ai2, and A22 = ^^2^2 • Now, we have 
r-iA^c: - f /A,(SfcnB„BfcnB,) 

- [Af,g^^ g2j\ A,(SfenB„BfenB,- 

^ ( " V (19) 

1-^2 Aj2^r ar'A(SL,Bi) ^2 AiSB,,BR) 
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and 



A(B, S)A(S, S)-iA(S, B) - {G^'AiS, B)f{g^'AiS, B)). 



(20) 



With these different optimizations, the total cost ends up being 



2m?'n 



^mn^. Compared to the 



original cost with a block structure, the cost is reduced by more than half when m = n. The cost with 
both optimizations, for sparsity and symmetry, is 25% of the original cost when the condition m « is 
satisfied. 



Storage cost reduction In addition to the computation cost reduction, the storage cost can also be 
reduced for symmetric A. Since the storage cost is mainly for U, which is always symmetric, the storage 



cost is reduced by half. Another part of the storage cost comes from the temporary space needed for (111, 



which can also be reduced by half. Such space is needed for the top level update in the cluster tree. It is 
about the same as the cost for U but we do not have to keep it. So for a typical cluster tree with ten or 
more levels, this part is not important. When the mesh size is small with a short cluster tree, this part and 
its reduction may be important. 



4.3 Optimization for computing and current density 

We can also reduce the cost for computing by optimizing the procedure for (12): 

7^ S(B, B) - /:S(S, B) - S(B, S)C^ + £I](S, S)C^ 



(21) 



We now consider optimizations that use the sparsity pattern of A and S, and separately the symmetry of 
A and S. Those optimizations can be combined but this will not be presented in this paper. 



4.3.1 sparsity 



Since S(S, B) and S(B, S) are block diagonal, the cost for the second and the third term in (21 ) is reduced 
by half. Similar to the structure of A(S, S) in ( 11 ), the blocks S(Sfc n B,, Sfe n Bj) and S(Sfe n Bj, Sfc n B^) in 
S(Sfc, Sfc) are diagonal. Consequently, the computational cost of operating with these blocks will be ignored 
since it is negligible. The cost for the fourth term £~^I](S,S)£" 
s^b + ^sb^, depending on the order of computation), 
sb " ~ 



|sb2) 



in (21 ) is then reduced to ^s^b + sb^ (or 

' ^ 2sb2 (or 

Compared with the cost without optimization for sparsity, it is reduced by 37.5% when s = b. 



So the total cost for (21) becomes 



In addition, S is often diagonal or 3-diagonal in real problems. As a result, S(S/c n Bi,Sfc n Bj) and 
S(Sfc n Bj, Sfe n Bi) become zero and Sfc(Sfc, Sfe) becomes block diagonal. This can lead to further reduction. 



4.3.2 G< symmetry 

For symmetric S, we have a property of symmetry and positive definiteness preservation for Sg+( 



Bg, Bg) 



Bg, Bg) 



and Eg(Sg,Sg). With such symmetry, we can perform a Cholesky factorization on S(S,S) 



The total cost for 



is reduced to gS'^ 



jsb . Compared to the cost of computation without 



exploiting the symmetry (?b + 3sb^), the cost is reduced by approximately 33% when s = b. 



4.3.3 Current density 

We can also exploit the sparsity when computing the current density, but it cannot improve much since there 
is not as much sparsity in (|8])-( 10 ) as in ( 11 ). This can been seen in Fig. |8] 

However, we can significantly reduce the cost with another approach. In this approach, we still compute 
the entries of G'' for every leaf cluster, but compute the current density only for half of the leaf clusters 
indicated by the solid nodes in Fig. 9(a) This approach is still based on ([8])~(10), but in addition to 
computing the current density for nodes 9-12, we also compute the current density for nodes 1 and 2, which 
incurs no extra cost. Therefore, it reduces the cost for the current density by half. 
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Figure 8: Sparsity of the matrices before the last step of elimination. Shaded blocks are non-zero entries 
and blank blocks are zero entries. 



Moreover, if for the chosen leaf clusters, we compute not only the off-diagonal entries for the current 
density, but also the diagonal entries for nodes 5-12, then we can completely avoid the last step of elimination 
and the computation thereafter for the other half of the leaf clusters in the final stage. Detailed analysis 
shows that computing both the diagonal entries of C and the current density with this approach is in fact 
even less costly than computing only the diagonal entries of C without this approach. Fig. [9] shows how we 
choose the leaf clusters to cover the desired diagonal and off-diagonal entries in C. Note that this approach 
(even though more efficient) was not implemented in our numerical benchmarks. 

We can also optimize (|8])-(10) by similar techniques used in Section 4.2 to reduce the cost for symmetric 
A and S. In 

A(C,C)-A(C,Bc)(Z^c)~'A(Bc,C) 



is of the same form as (11 1 so the cost is reduced by half. The cost of computing its inverse is also reduced 
by half when it is symmetric. Eqs. ([9| and ( 10 1 are transpose of each other so we only need to compute one 
of them. As a result, the cost for computing the current density is reduced by half using symmetry. 




O >• -o-o- 



(a) Coverage of all the needed current 
densities using half of the leaf clusters. 
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(b) Matrix entries. The notations here 
follow those in Fig. [s] 



Figure 9: Choice of leaf clusters and the corresponding matrix with desired entries marked. Each needed 
cluster consists of four solid nodes. The circle nodes are "skipped." Arrows correspond to the current 
densities. If we compute the entries in the shaded area and in the patterned blocks, then all the desired 
entries will be covered. See Fig. [8] for the node numbering. 
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4.4 Nodes on the physical boundary of the device 



Some of the nodes on the physical boundary of the device have a different connectivity. For example, a typical 
node in a 5 point stencil is connected to 4 neighbors. A node on the physical boundary of the device may be 
connected to only 3 neighbors. This reduces the computational cost of our method since those nodes can be 
eliminated early in the scheme leading to overall smaller boundary sets for some of the clusters. This has a 
small effect for leaf clusters (i.e., small clusters) but leads to a significant reduction in the computational cost 
near the root of the tree when we have few very large clusters. With such effect taken into consideration, 
the total cost of the algorithm is approximately 457iV^ (see the appendix). As a comparison, the cost using 
RGF with symmetry taken into consideration is Since the cross-point of the two cost curves is at N 

equal to the ratio between the constant factors in the two methods, it is expected to happen at ~ 130. 
This is consistent with what we observed in our simulations. 

Section Summary. The optimization for sparsity leads to 40% cost reduction and that for symmetry leads 
to 50% cost reduction. This results in a 70% cost reduction. In other words, the new cost is about 1/3 of 
the original cost. This is reflected in the cross-point of the performance curves using these two methods. 
Originally, the cross-point was around A^ ^ 130, but now it can be as low as A^ = 40. This shows the 
performance improvement resulting from our optimizations. The optimization for sparsity does not reduce 
the memory cost. The optimization for symmetry does reduce it (approximately by half). 



5 Numerical Results 



5.1 Extension 



In the following two figures. Fig. |10(a) gives the running time for computing based on the extension of 



FIND as described in Section |3] and the comparison with RGF. Fig. |10(b)| shows the same data in log-log 
scale to see more clearly the asymptotic behavior of these two methods. All the running times in these two 
figures are based on the extension of FIND before the optimizations discussed in Section |4j 




Figure 10: Comparison of running time based on FIND extension (before optimization) and RGF for 
computing 



5.2 Optimization 

Fig. |ll(a)] shows the computation cost reduction for clusters of different sizes during the upward pass, after 
optimization for sparsity. As indicated in Section [4. 1[ the improvements are different for two types of merge: 
i) two (a X a)-clusters ^ one (a x 2a)-cluster; n) two (2a x a)-clusters one (2a x 2a)-cluster, so they are 
shown separately in the figure. The reduction for small clusters is not as significant. This is because the 
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second order contributions to the cost (e.g., O^m'^)) was ignored in our analysis but is significant for small 
clusters. 



E 
o 
O 




2^> 2 2 2 2 

a 

(a) Optimization based on sparsity 



E 
o 
O 



two 2axa => one 2ax2a - 
two axa => one 2axa - 




2^ 2 2 2 2 2 

a 

(b) Optimization based on symmetry 



Figure 11: Comparison between RGF and FIND after optimization 



Fig 11(b) shows the computation cost reduction after optimization for symmetry. Similar to the Fig. 11 (a 
we show the reduction when merging clusters at each level in the basic cluster tree. We can see that the cost 
is reduced almost by half for clusters larger than 32 x 32. The smaller reduction for small clusters is mainly 



10' 




RGF data * 

RGF-fit 

Optimized FIND data x 
Optimized FIND-fit - - ■ 



32 40 48 



96 128 
N 



192 256 



Figure 12: Comparison with RGF to show the change of cross-point 

due to the fact that the Cholesky factorization does not provide savings by a factor of 2 for these clusters. 
Although the reduction in cost is smaller for small clusters using either optimizations, these clusters do not 
play a significant role for a mesh larger than 32 x 32 since the top levels of the tree in Fig. 2(a)| dominate 
the computation cost. 



Fig. 12 compares the computation cost of RGF and optimized FIND on a Hermitian and positive-definite 
matrix A and shows the overall effect of the optimization for sparsity and symmetry. The simulations were 
performed on a square mesh of size N x N. The cross-point is around 40 x 40 according to the two fitted 
lines. We confirmed this by adding a simulation at that point. 
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6 Conclusion and discussion 



We have presented an extension of the method described in [5] to the calculation on and the current 
density in NEGF models of nano-transistors. This extension is essential to apply the methods in [5] to real 
engineering problems. This extension assumes that S has a sparsity pattern similar to A. It is based on a 
nested dissection of the mesh and uses a form of Gaussian elimination. In the family of direct methods that 
provide exact solutions (vs approximate methods), it can be proved that the scaling of the computational 
cost with problem size using nested dissection is optimal. 

We considered the case of a typical finite-difference scheme for a two dimensional device. The sparsity 
pattern associated with the stencil allows special optimizations, which can reduce the computational cost by 
a factor of approximately 2. These optimizations are generally applicable to other types of finite-difference 
stencils, even though this extension was not discussed here. When E is sparser than A, for example if it is 
a diagonal matrix, further cost reductions are possible. 

We also presented optimizations for the cases where A is symmetric. In this case, the computational 
cost of many operations can be reduced using Cholesky type factorization. We also proved results regarding 
properties of positive definiteness for matrices arising in the process, for example lA. Those optimization 
techniques were derived for computing both G and G^. 

We described a simple approach to reduce the cost of computing the current, which uses an octagonal 
tiling of the mesh. This allows reducing the number of leaf clusters that are computed while yielding all the 
off-diagonal entries that are needed. 

Finally, regarding the nodes on the physical boundary of the device, we calculated their effect on the 
computational cost. The cost associated with clusters that share nodes with the physical boundary of the 
device is typically lower since the size of the boundary set of the cluster is smaller. Consequently, this reduces 
significantly the computational cost of operating on the clusters near the top of the tree. 

Numerical results confirm the computational cost analysis and show the speed-up that can be expected 
from applying these methods to real-life applications. 

We have not discussed the role that certain classes of fast methods like multigrid algorithms can play 
for our problem. Multigrid algorithms have been used successfully to solve a wide range of linear systems 
Aa: = 6 in linear time [551 137] . In the area of semiconductor process and device simulation, multigrid has been 
applied, for example, to the solution of various partial differential equations, including Lame equations (linear 
elasticity), reaction-diffusion (-convection) and drift-diffusion (-convection-reaction) systems [36] . In the 
NEGF area, multigrid has been used in the solution of the Kohn-Sham equations (specifically the associated 
generalized eigenvalue problem) [351 [Ml HOI E] • The computational cost of multigrid when solving a linear 
system is lower than the FIND algorithm. However, it is not known whether adapting multigrid from solving 
a linear system of equations Ax = 6 to computing in linear time the diagonal of G'' = A^^ and G"" S [G'']^ 
is possible. 
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In the appendix we provide additional technical results that were used in the main body of this paper. 

A Symmetry preservation 

Proof: This property holds for all the leaf clusters by the symmetry of the original matrix A. For any node 
in the cluster tree, if the property holds for its two child clusters i and j, then the property holds for their 
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parent node k as well because the blocks of A in ( 11 ) are given by 



, , /A,(BfenB„BfenB,) A(BfenB„BfenB,)\ 

^ , /A,(Sfc n B„Sfc n Bi) A(Sfe n B,;,Sfe n Bj) \ , , 

- (^A(Sfe n B„Sfc n B,) A,(Sfc n B^s^ n B.oj ' ^ ' 

^ (R - /^A,(BfcnB„s,nB,0 o \ 

^'=(^^■'^^■^-1, A,(B,nB„s..nB,)j' (22c) 

A R ^ /^A,(SfcnB„B,nB,0 \ 

^'^(^^'^^■^-l A,(S,nB,,B,nB,)j- (22d) 



□ 



B Posit ive-definiteness preservation 

Proof: Write n x n symmetric positive definite matrix A as ^ ^ and let An An — zz^ /an 

be the Schur complement of an. Let A(") A, A(i) aJ°\ . . . , A^""!) A^""^\ Note that since A 
is positive definite, a'^^I ^ and A('+^) is well defined for alH = 0, . . . , n — 2. By definition, A*^*+^^ are also 
all symmetric. 

Now, we will show that A*^^', . . . , A^"^^^ are all positive definite. Given any (n — 1) x 1 matrix y ^ 0, 
let a; = ^ ^ y^*^^^) ^ ^' ^^^'"'^ ^ symmetric positive definite, we have 

< x'^Ax 



^ H - Anj \ y ) 
(0 -yV/an + 2;^An)(-"Y"") 



= -y'^zz^y/aii + y^Any 
= y'^ [All - zz^ / aiijy 
= y^Any. 

As a result, A^^' is positive definite as well. Repeat this process to show that A^^^ A^""^) are all 
positive definite. 

Since any principal submatrix of a positive definite matrix is also positive definite |33) . every A(Sfc,Sfc) 
in (|2| is also positive definite. □ 

C Optimizations that use the sparsity of A 

The sparsity pattern was described in the main body of the paper. In this section, based on this sparsity 
pattern and different ways to calculate the inverse of a matrix, we reduce the number of floating point 
operations. 

Recall that we need to calculate the term: 

A(B,S) A(S,S)-i A(S,B) 

which is part of the update of hi: 

U = A(B, B) - A(B, S)A(S, S)-^A{S, B) 
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For notational simplicity, in this section, we write A(S, S) as a block matrix: 



A B\ /A(Sl,Sl) A(SL,Si^) 
CD U(Sk,Sl) H^r,Sr) 



write A(S, B) as 



and write A(B, S) as 



W 
Y 



/^A(Sl,Bl) 
U(Sfl,Bi) 



A(Sfl,B^) 



P Q\ /A(Bl,Sl) A(Bi,Si^) 
R S) \A{Er,Sl) A{Er,Sr) 

Since a multiplication is much more expensive than an addition, we ignore the addition with A(B, B). 

Before proceeding to the main analysis, we list some facts about the computational cost of various basic 
matrix computations. We indicate the cost of various operations (shown with =>) involving the n x n full 
matrices A and B (only multiplications are counted): 



A 

L 

U,L- 



LU : 

L-i : 
L-^B : 
A-^B 



n^/3 



Adding them together, computing A~^ requires rv^ and computing A~^B requires ^n^- The order of com- 
putation is often important, e.g., if we compute A^^B — {U~^L~^)B, then the cost is (n^ + rv" —) 2n^, 
whereas computing U~^{L~^B) only requires -'"^ 



We will consider different block forms for the inversely 



A-^ 

-D-^CA-^ 



A-^BD- 



A-^BD-^CA-^ 
-D-^CA-^ 
-1 



-A-^BD-^ 



/ 

CA-^ 



See (17a), (17b), and (17c). In (17a 



the calculations for A^^ and D 
can be done in parallel, so we call this method parallel inverse. 

calculate the other block of the inverse: A^^ + A^^BD^^CA~'^, so we call this method sequential inverse. 
In (17c), we perform block LU factorization first and then calculate the inverse, so we call this method block 



^ are independent of each other and 
In (17b), we have to calculate first to 



L U inverse. 



The first method is based on (17a I, the parallel inverse of the block matrix. It computes A(S,S) 



through the Schur complement of the two diagonal blocks of A(S, S). We multiply ( 17a) by A(S, B) on the 
right. The result of this multiplication is a 2 x 2 block matrix, whose blocks are shown in Table [2] The 
calculation of each block requires a set of operations listed in the Operations columns. The details of how to 
perform those operations in order to minimize the computational cost are given in Table [3] 

Table [3] gives the cost of each operation with different assumptions for the sparsity pattern of X and Y. 
The last column indicates the dependency between these operations, as some of the operations depend on 
the results of previous operations. 

In this table and the tables below, to show dependence, the operations are listed in the order of compu- 
tation. The size of A, B, C, and D is mx m; the size of W, X, Y, and Z is m x n. The different operations 
are labeled using a number inside a circle. This notation was also used in Table [2] 

^We assume A and D to be nonsingular. 
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Table 2: Matrix blocks and their corresponding operations 



Block Expression Operations 



(1, 1) 


A- 


^W- 


A-^BD-^Y 


® 




(1,2) 


A- 


^X- 


A-^BD-^Z 


® 


® 


(2, 1) 


-D 


-^CA 


-^W + D-^Y 


® 




(2, 2) 


D- 


-^CA- 


-^X + D-^Z 


@ 


® 



Table 3: The cost of operations and their dependence in the first method. The costs are in flops. The size 
of A, B, C, and D is m x m; the size of W, X, Y, and Z is m x n. 

Type of matrix blocks 



Operations 


all full 


X, Y = 


X, Y = 0; 
B. C = diag 


Dependency 


® D\C 


4m3/3 


4m3/3 




n/a 


® A\B 


4m3/3 


4m3/3 




n/a 


(D A = A - BD-^C 


m 


m 





® 


(1) D = D- CA-^B 


ra 


ra 





® 


(D A\W 


^ + rn^n 


^ + rn^n 


^ + m^n 


® 


® D\Z 


^ + m^n 


3 r, 

^ + m'^n 


^ + nn?n 


® 


@ -D-^CA-^W 


m?n 


ni^n 


m?n 


®® 


(D -A-^BD-^Z 


rn^n 


rn^n 


m?n 


®® 


(9) A-^X 


m?n 








® 


@ b-^Y 


m?n 








® 


@ -D-^CA-^X 


m^n 








®® 


(g) -A-^BD-^Y 


rn^n 








®@ 


Total 




16™' + Am'^n 


^ + 4m2n 


n/a 



Since the clusters are typically equally split, we can let m = |Sl| = |S_r| = |S|/2 and n = \Bl\ = \Br\ = 
|B|/2. Then, the size of matrices A,B,C, and D is to x m; the size of matrices W,X,Y, and Z is m x n, 
and the size of matrices P, Q, i?, and S* is m x n. For example, for a merge from two (a x a)-clusters to one 
(a X 2a)-cluster, we have m = a and n — 3a. 

Table § gives the cost of computing A(S, S) ^ A(S, B) in 111]). Consider now the multiplication by A(B, S) 
from the left: it requires Arn^n operations. The total computation cost for (11) with the above optimization 
is therefore Ito^^ + Am^n + Amn^. In contrast, the cost without optimization is Smn"^ + Snr^n + |m'^ since 
s — 2m and b — 2n. 

Taking two upward cases as examples, (m, n) — (a, 3a) and (to, n) ~ (2a, 4a), the reduction is: 

296 . 152 „ , 1216 . 640 , 

-> a and a"^ a'^. 

3 3 3 3 

The cost with optimization lies somewhere between 51% to 53% of the original cost. Similarly, for the 
downward pass, in the two typical cases {m,n) = (3a, 2a) and (m,n) = (4a, 3a), we have the reduction 

312a3 ^ 192a3 (62%) and — ^ — a^ (60%). 

3 3 

In the downward pass, \Bl\ but in terms of the computational cost estimate, we can assume that 

they are both s/2. 
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The second method is based on ( |17b[ ), the sequential inverse of the block matrix. Similar to the first 
method, it does not compute A(S,S)^^ explicitly. Instead, it computes A(S,S)^^A(S, B) as 

(A B\^^ (W Q\ _ ( A-~^W + A-^BD'^CA-^W -A-^Bb-^z\ 
\C D) VO - y -D-^CA-^W D-^Z J' 

where D = D — CA^^B, B and C are diagonal, and X — Y — 0. Table [4] shows the required operations 
with the total cost ^m^ + bm?n + Aran? . 



Table 4: Operation costs in flops and their dependence in the sequential inverse method. 





Operation 


Cost 


Dependency 


® 




vn? 


n/a 


@ 


b 





® 


® 




^ + rr?n 


@ 


® 


-{A-^B){b-^Z) 


rrfin 


®® 


® 


A-^W 


m'^n 


® 


® 


-b-^{CA-^w) 


m?n 


®® 


® 


{A-^W) + {A-^B){b-^CA'^W) 


m?n 


®®® 


® 


A(S,B)A(S,S)-iA(S,B) 


4mn? 


® 




Total 


|to^ + 5m^n + Amr? 


n/a 



Compared to the previous method, the cost here is smaller if m > |n. Therefore, this method is better 
than the first method for the downward pass where typically (m, n) is (3a, 2a) and (4a, 2a). In such cases, 
the costs are 174a3 (56%) and (56%), respectively. 



The third method is based on (17c I, the block inverse of the block matrix. The two factors there will 



multiply A(B, S) and A(S, B) separately: 

A(B,S) A(S,S)-i A(S,B) 



p 




fA B' 




") 


[0 b^ 


p 








") 


a' 


PA 


-1 


-PA-^ 



-A-^BD- 

b-^ 



w 



I 

^CA-^ 




Z 





BD-^ 

sb-^ 



w 

-CA-^W 



W 




(25) 



The cost of (251 is -^m? + Am?n + 5mr?, which comes from the multiplication of a block upper triangular 
matrix and a block lower triangular matrix. Table [H] shows the order and details of the operations. Sb~^ 
is computed by first LU factorizing b and then solving for SD^^. The LU form of D will be used when 
computing ~{PA~^B)b~^ as well. 

Finally, our fourth method is more straightforward. We call it naive LU method. It considers A(B,S) 
and A(S, B) as block diagonal matrices but considers AYS, S) as a full matrix without exploiting its sparsity. 



Table J6] shows the order and details of the operations. The 



This method gives cost |m + n + Amn 

cost for computing the second column of L~^A(S, B) is reduced since A(S, B) is block diagonal. 

The summary of all the optimization techniques for sparsity and the corresponding costs is given in the 
main body of this paper, in Table [l] 
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Table 5: Operation costs in flops and their dependence in the block LU inverse method. 





Operation 


Cost 


Dependency 


CT) 




3 


n/a 


(2) 


I) 





CT) 


® 






@ 


® 


-{A-^B){b-^Z) 


rn^n 


®® 




A-^W 


m^n 


® 


® 


-b-^{CA-^W) 


m?"n 


®® 




+ {A-^B){b-^CA-^W) 


m?"n 


®®® 


® 


A(S,B)A(S,S)-iA(S,B) 




@ 




Total 




n/a 



Table 6: Operation cost in flops and their dependence in the naive LU inverse method. 



Operation 


Cost 


LU for A(S, S) 




L^^A(S, B): 1st block column 




L-^A{S, B): 2nd block column 


1 2 


U~^[L-^ A{S,B)] 




A(S,B)A(S,S)'iA(S,B) 


Amn^ 


Total fm^ 4 





Finally, Table [7] shows the estimated level-by- level costs with the physical boundary of the device taken 
into consideration (see Section "Nodes on the physical boundary of the device" ) . This cost does not assume 
any optimization. It simply accounts for the effect of the boundary of the physical device, which leads to a 
computational cost reduction. 

The last two rows for each pass are the sum of the cost of the rest of the small clusters. This estimate does 
not consider the two end blocks corresponding to the device contacts as in real applications [5j[2T]. Accounting 
for the two end blocks, if we do not have any optimization, then the grand total is: 4507V^ -I- ^iV'^ -I- 2iV'^ w 
457A^'^. As a comparison, the cost using RGF with symmetry taken into consideration is S.SA^**. Since the 
cross-point of the two cost curves is at N equal to the ratio between the constant factors in the two methods, 
it is expected to happen at TV ~ 130. This is consistent with the results from our early simulations (between 
N = 100 and N = 150) 0. 

Summary. The optimization for sparsity leads to 40% cost reduction and that for symmetry leads to 50% 
cost reduction. This results in a 70% cost reduction. In other words, the new cost is about 1/3 of the original 
cost. This is reflected in the cross-point of the performance curves using these two methods. Originally, 
the cross-point was around N ~ 130, but now it can be as low as = 40. This shows the performance 
improvement resulting from our optimizations. The optimization for sparsity does not reduce the memory 
cost. The optimization for symmetry does reduce it (approximately by half). 

D Additional proofs 

In this section, we present some additional technical results and proofs required to show correctness of our 
algorithm. All the relevant properties of the mesh node subsets for our algorithm are listed, along with their 
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Table 7: Computation cost estimate for different type of clusters from an iV x mesh, s is the size of the 
private inner node set and b is the size of the boundary set. The size is in unit of and the cost is in unit 
of N^. By convention, the root (the whole mesh) of the cluster tree is level 0. The cost column gives the 
total cost for the given subset of clusters. For example, on row one, under upward pass, we have 2 clusters 
at level 1 with s = 1 and b = 1 and the cost for operating on these clusters is 4.666 A^^^. 

upward pass downward pass 



size per cluster & total cost 


c 


lusters 


size 


& cost 


per cluster 


clusters 


s 


b 


cost 


level 


number 


s 


b 


cost 


level 


number 


1 


1 


4.666 


1 


2 





1 





1 


2 


1 


1 


9.332 


2 


4 


1 


1 


9.332 


2 


4 


1/2 


3/4 


2.040 


3 


4 


3/2 


3/4 


14.624 


3 


4 


1/2 


5/4 


4.540 


3 


4 


1/2 


5/4 


4.540 


3 


4 


1/2 


1 


3.168 


4 


4 


1 


1/2 


4.332 


4 


4 


1/2 


3/4 


4.080 


4 


8 


1/2 


3/4 


2.040 


4 


4 


1/2 


1/2 


1.168 


4 


4 


3/2 


3/4 


14.624 


4 


4 


1/4 


3/8 


0.2552 


5 


4 


1 


1 


9.332 


4 


4 


1/4 


1/2 


0.3958 


5 


4 


1 


3/4 


52.672 


5 


32 


1/4 


5/8 


1.703 


5 


12 


3/4 


1/2 


38.976 


6 


64 


1/4 


3/4 


2.312 


5 


12 






<91.648 


7... 




1/4 


1/2 


6.336 


6 


64 












1/8 


3/8 


3.072 


7 


128 
















<9.408 


8... 














Subtotal 


52.470 






Subtotal 


242.120 






Total: 


294.6 

















proofs for the basic FIND algorithm in Li et al. [S] . We will then use these properties to prove two theorems 
and four corollaries used for computing G^. Some of the properties are not directly used in the proofs but 
are still listed to help understand the algorithm. 

Property [3] is fairly simple but will be used again and again in proving theorems, corollaries, and other 
properties. 

Property 3 By definition ofT^, one of the following relations must hold: Ci C Cj, D Cj , orCiOCj — 0. 

Property |4] shows an alternative way of looking at the inner mesh nodes. It helps understand Property[5] 
Property 4 1,^ — UcjCc,Sj, where Ci and Cj e T+. 

Property 5 // Cj and Cj are the two children of Cj,, then U B^, = U Bj and Sk = (B^ U Bj)\Bfc. 

Property 6 For any given augmented tree T+ and all Ci G T+, Si, B_r, and Cr are all disjoint and 
M = (Uc_gT+S,) U B_,. U C^. 

Property [6] shows that the whole mesh can be partitioned into subsets Si, B_r, and C^. See Sections 
"Brief description of the FIND algorithm" and "Formal description of the FIND algorithm" for notations 
and figures. 

Below we list properties of Si for specific orderings. 
Property 7 If Si < Sj, then Cj (jt. Ci, which implies either Ci C Cj or Ci n Cj — 0. 
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This property is straightforward from the definition of S,; < Sj and Property [3[ 

Properties |8] and [9] below are related to the elimination process. They were used in the proof of the 
following theorem (see main body of paper, theorem [ij: 

Theorem 2 For any target clusters r and s such that Cg G T+ and Cg e T+, we have 

Ar,g+(Bg, Bg) = As^g+(Bg, Eg). 

They will also be used in the proof of Theorem [3] and Theorem |4] 

Property 8 For any k,u such that C^, C„ G T+, if ^ C^, then B„ H Ifc = 0. 

Proof: By Property [Sj we have Cu ^Ck. So for all j such that Cj C , we have j 7^ m and thus Sj n S'^ = 
by property 3. By property 2, 1^ = ^Cj<zCkSj, so we have Su = 9- □ 

Property 9 // Cj is a child of Ck, then for any C„ such that Sj < S„ < Sfc, we have Cj n = and thus 
Bj n B„ = 0. 

This is because the can be neither a descendant of Cj nor an ancestor of C^. 

In Theorem [3] and its proof below, we follow the same notation used in Li et al. [F. In particular, we 
write Sg more formally as S^^. , with j ~ 1,2, . . ., etc., where A^^ = A, S^^ — S, and S^^. < S^^, iff j < j' . 
We also denote Si-^-^ as Sg+ when g = ij. Because of Property [61 the whole mesh M is partitioned into a 
sequence of sets Si-, B_r, and C^, j — 1,2, . . ., and the order of^. and S is determined by this sequence. 
For notation convenience, if i is the index of the last set in the sequence of , which is always — r for T+, 
then Si_|_ stands for B_r. 

Theorem 3 // we perform Gaussian elimination as described in the Section "Formal description of the 
FIND algorithm" and update Sg accordingly based on Sg_|_ ~ L^^^SgL^"^, then we have: 

(I) ySh > Sg,Sg(S^,S>ABh) = Sg(S>,AB,,,S,0 - 0; 

(II) (a) Sg+(S<g,S<g) = Sg(S<g,S<g); 

(b) Sg+(M,S>g\Bg)=Sg(M,S>g\Bg); 
(C) Sg+(S>g\Bg,M) = Sg(S>g\Bg,M); 

(III) Sg+(Bg, Bg) = Sg(Bg, Bg) ~ C gS g {S g , Bg) - Sg(Bg, Sgj^ + £gSg(Sg, Sg)^ . 

In other word, this theorem describes how the multiplications by Lg ^ and L" ^ affect Sg. Specifically, 
these multiplications only affect Sg(Bg, Bg), which is given by (III), and Sg(S<g, Bg) and Sg(Bg, S<g), which 
are not given here but can be ignored. The multiplication preserves the sparsity pattern specified by (I) 
above, which is illustrated by the following submatrix for any S/,. > SgJ^ 

Sh B/i S>/i\B?i 
S/i X X 

B;j X X X 

S>h\Bh Ox X 

Proof: Recall that —Cg — — Lg(Bg,Sg) is the only non-zero off-diagonal block of Lg^- As a result, all 
entries in Sg(»,S<g) and Sg(S<g,») are irrelevant to the multiplication and (I) implies (II) and (III). So it 
suffices to prove (I) by mathematical induction. 

For g = ii, (I) holds because no nodes in S;i and S>;i\B^ are connected to each other and in the original 
matrix S, an entry is non-zero iff the two nodes associated with the column and the array of the entry 
connect to each other. If (I) holds for g — k, then by Property [3] we have either Cfc C Ck+ or n Ck+ = 0. 
So, 

^Similar to the notation Ag(», Bg), the entries of the blocks in this submatrix usually do not stay together (except the 
entries of Sg(Sji,Sfi), which always stay together). We write them here like blocks to better illustrate the sparsity pattern of 

Sg. 
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• if Cfc C Ck+, consider u such that Sfc+ < S„. By Property [sj we have \k+ H S„ = 0. Since Ck+ = 
lfc+UBfe+, wehave (S„\Bfc+)nCfe+ =0. So we have Bfcn(S„\Bfc+) C (S„\Bfe+)nCfc C (S„\Bfe+)nCfe+ = 
0^ Bfcn(S>fe+\Bfc+) = 0; 

• if Cfe n Cfc+ = 0, then B^ C Cfc and Sfe+ C Ck+ =^ Bfc n Sk+ = 0. 

So in both cases, we have (Bfc, Bfc) n (Sfe+, S>fc+\Bi:+) = 0. In addition, by Property [6] we have SfenS/c+ = 0. 
Since all (I), (II), and (III) hold for g ^ k, wc have Sfc+(Sfc+, S>fe+\Bfc+) = Sfc+(S>fc+\Bfc+, Sfe+) = 0. 

Since the above argument is vahd for all h > k+, we have Vft. > k+, Sfc+(S/j, S>/,.\B/,,) = Sfc_|_(S>?i\B;j, S^) = 
0, i.e., (I) holds for g = k+ as well. By induction, we have that (I) holds for all g such that Cg e T+. □ 

Corollary 1 // and Cj are the two children of C^, then Sj,(Bj,Bj) = S(Bj,Bj) and Sfc(Bj, B^) = 
S(B„B,). 

1 J. 

In words, the corollary states that when we compute SfcL^. ' , we can take some of the entries in 
directly from the original S. 

Proof: By Theorem [sj the multiplications by and L"''' only change the entries of S„ in (B„,B„), 
(S<,j, B„), and (B„, S<„). So it suffices to show that for every S„ < Sfc, we have 

(B„,B„)n(B„Bj) = 0, (26) 
(S<„, B„) n (B„ Bj) = (B„, S<„) n (B„ B,) = 0, (27) 

and similar identities with i and j interchanged. 
To show (26 1, consider the following three cases: 

• c„ n Cfe = ^ c„ n c, = c„ n c^- = ^ b„ n b, = b„ n b^- = 0; 

• c„ c c, c„ n Cj = ^ B„ n b^- = 0; 

• c„ c ^ c„ n Q = ^ B„ n B, = 0. 

Because WSu < Sfe, by Property [t] (with some extension), one of the above cases must hold. So we have 
proved (26). 

To show (27 1, consider S^, < Sfe. Because 

S„ < Sfe => Cfe (^Cy^ Q (/L C„ 

by Property [7| we have either 

c„ n c, = ^ s„ n B, 

or 

C„ C Ci (by Property m S„ C 1^ =^ Sy H B^ 0. 
We can also conclude S^ n Bj = by a similar argument, so we have proved (27). □ 

Corollary 2 // C^ is a child o/Cfe, then Sfe(Bi,Bi) = S,i4_(B,;, B^). 

Proof: Consider u such that Si < S„ < Sfe. By Property [Oj we have B^ n B„ = 0. By Theorem [Sj the 
multiplications by L^^ and Li^^ only change the entries of S„ in (B„,B„), (S<u, B„), and (B„,S<„) and 
none of them will change the entries in (B^, B^), so the corollary is proved. □ 

Corollary 3 If Ci is a leaf node in T+, then Si(Ci,Cj) = S(Ci,Ci). 

Proof: Consider S„ < 5^. By Propertyjs] either C„ C C^ or C„ n C^ = 0. Since Ci is a leaf node, there is no 
u such that C„ C C,;. So we have C„ n C^ = and thus B^ n C^ = 0. By theorem [s] the corollary is proved. 

□ 

Theorem 4 For any r and s such that Ci G T+ a nd C,; G T+, we have: 

S^,,(S, U B„ S, U B,) = ,,(S, U B,, S, U B,) (28) 
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Proof: If Ci is a leaf node, then by CoroUarylsl we have Sr,i(Si U B^, St U B^) ~ Sr,i(Ci, Cj) = Sr(C,;, C^) = 
Ss(Ci, Ci) — T,s^i{Ci,Ci) — Ss_i(Si U Bi, Si U B7) 

If (28) holds for i and j such that Ci and Cj are the two children of Ck, then 

• by Theorem [Sj we have 

o Sr,i+(B^,B^) = Ss,i+(Bi,Bi) and 
o S,.j+(Bj-, Bj) = Ssj+(Bj, Bj); 

• by Corollary [l] we have 

o S,,fc(B„ Bj) = S,(B„ Bj) = S,(B„ B^) = S,,fe(B„ Bj) and 
o S,.^fc(Bj, Bi) = Sr(Bj, Bi) = Ss(Bj, Bi) = S<;^fc(Bj, Bi). 

• by Corollary [2j we have 

o Sr^fe(Bi, Bi) = Sr,i+(Bi, Bi) — Ss^i+(Bi, Bi) — Ss^fe(Bi, Bi) and 
o Sr,fc(Bj, Bj) ~ S.r.j+(Bj, Bj) = Ssj-+(Bj, Bj) = Ss_fe(Bj, Bj); 
NowwehaveSr,fe(BiUBj,BjUBj) = ^^^^(BjUBj, BiUBj). By Property[5] we have Sr,fc(SfcUBfc, S^UBfe) 
'^s,k{Sk U Bfc,Sfc U Bfe). By induction, the theorem is proved. □ 

Corollary 4 For any r and s such that Ci G T+ and Ci G T^, we have: 

Sr,i+(Bi, Bi) = Ss^i+(Bi, Bi) 

This corollary shows that the update processes for two target clusters C^ and Cs share their partial results 
so we can reuse them in our algorithm. The proof is trivial given Theorem [3] and Theorem [4j 



References 

[1] S. Datta, Electronic Transport in Mesoscopic Systems, Cambridge University Press, 1997. [s] 

[2] A. Svizhenko, M. P. Anantram, T. Govindan, B. Biegel, R. Venugopal, Two-dimensional quantum 
mechanical modeling of nanotransistors. Journal of Applied Physics 91 (4) (2002) 2343-2354. |3] 

[3] A. Ghosh, P. Damle, S. Datta, A. Nitzan, Molecular Electronics, Materials Research Society bulletin 
(2004) 391. |3] 

[4] A. Martinez, M. Bescond, J. Barker, A. Svizhenko, M. Anantram, C. Millar, A. Asenov, A self-consistent 
full 3-D real-space NEGF simulator for studying nonperturbative effects in nano-MOSFETs, IEEE 
Transactions on Electron Devices 54 (9) (2007) 2213-2222. [3] 

[5] S. Li, S. Ahmed, G. Klimeck, E. Darve, Computing entries of the inverse of a sparse matrix using the 
FIND algorithm. Journal of Computational Physics 227 (2008) 9408-9427. [sj [i) [t) [s) [Toj [l3) [2l] [26) [27) 

m 

[6] E. Darve, S. Li, Y. Teslyar, Calculating transport properties of nanodevices, in: Proceedings of SPIE, 
Philadelphia, PA, Vol. 5593, 2004, pp. 452-463. [3) [4) 

[7] S. Li, S. Ahmed, E. Darve, Fast inverse using nested dissection for NEGF, Journal of Computational 
Electronics 6 (2007) 187-190. [s) ^ 

[8] R. Lake, G. Klimeck, R. Bowen, D. Jovanovic, Single and multiband modeling of quantum electron 
transport through layered semiconductor devices. Journal of Applied Physics 81 (1997) 7845. [s) 

[9] S. Borm, L. Grasedyck, W. Hackbusch, Hierarchical matrices, Lecture note, Citeseer 21 (2003) 1-124. 
[3)i 

[10] W. Hackbusch, A sparse matrix arithmetic based on "H-matrices. Part I: Introduction to "H-matrices, 
Computing 62 (2) (1999) 89-108. [s) 



30 



[11] S. Chandrasekaran, P. Dewilde, M. Gu, T. Pals, X. Sun, A. van der Veen, D. White, Some fast algorithms 
for sequentially semiseparable representations, Siam Journal on Matrix Analysis and Applications 27 (2) 
(2006) 341-364. ^ 

[12] S. Chandrasekaran, M. Gu, T. Pals, A fast ULV decomposition solver for hierarchically semiseparable 
representations, SIAM Journal on Matrix Analysis and Applications 28 (2006) 603-622. |4] 

[13] S. Chandrasekaran, P. Dewilde, M. Gu, W. Lyons, T. Pals, A fast solver for HSS representations via 
sparse matrices, Siam Journal on Matrix Analysis and Applications 29 (1) (2008) 67-81. [4] 

[14] D. Mamaluy, M. Sabathil, P. Vogl, Efficient method for the calculation of ballistic quantum transport. 
Journal of Apphed Physics 93 (8) (2003) 4628-33. [I] 

[15] H. Khan, D. Mamaluy, D. Vasileska, Quantum transport simulation of experimentally fabricated nano- 
FinFET, IEEE Transactions on Electron Devices 54 (4) (2007) 784. [I] 

[16] M. Sabathil, D. Mamaluy, P. Vogl, Prediction of a realistic quantum logic gate using the contact block 
reduction method, Semiconductor Science and Technology 19 (2004) S137-S138. |4] 

[17] D. Mamaluy, A. Mannargudi, D. Vasileska, M. Sabathil, P. Vogl, Contact block reduction method and its 
apphcation to a 10 nm MOSFET device, Semiconductor Science and Technology 19 (2004) S118-S121. 

SI 

[18] D. Mamaluy, A. Mannargudi, D. Vasileska, Electron Density Calculation Using the Contact Block 
Reduction Method, Journal of Computational Electronics 3 (1) (2004) 45-50. [2] 

[19] S. Birner, C. Schindler, P. Greek, M. Sabathil, P. Vogl, Ballistic quantum transport using the contact 
block reduction (CBR) method. Journal of Computational Electronics 8 (3) (2009) 267-286. [I] 

[20] J. Tang, Y. Saad, A probing method for computing the diagonal of the matrix inverse. Tech. rep.. Tech. 
Report umsi-2010-42, Minnesota Supercomputer Institute, University of Minnesota, Minneapolis, MN, 
2009 (2010). |4] 

[21] A. Svizhenko, M. P. Anantram, T. R. Govindan, B. Biegel, Two-dimensional quantum mechanical 
modeling of nanotransistors. Journal of Applied Physics 91 (4) (2002) 2343-54. |4) [s) [l3) [l6) [26] 

[22] A. Erisman, W. Tinney, On computing certain elements of the inverse of a sparse matrix. Communica- 
tions of the ACM 18 (3) (1975) 177-179. [I] 

[23] D. Petersen, H. S0rensen, P. Hansen, S. Skelboe, K. Stokbro, Block tridiagonal matrix inversion and 
fast transmission calculations, J. Comput. Phys. 227 (2008) 3174-3190. [I] 

[24] L. Lin, J. Lu, L. Ying, R. Car, Fast algorithm for extracting the diagonal of the inverse matrix with 
application to the electronic structure analysis of metallic systems. Communications in Mathematical 
Sciences 7 (3) (2009) 755-777. [I] 

[25] L. Lin, C. Yang, J. Meza, J. Lu, L. Ying, Sellnv — An Algorithm for Selected Inversion of a Sparse 
Symmetric Matrix, ACM Transactions on Mathematical Software (TOMS) 37 (4) (2011) 40. ^ 

[26] A. George, Nested dissection of a regular finite-element mesh, SIAM Journal on Numerical Analysis 
10 (2) (1973) 345-63. |4) [8] 

[27] J. Liu, The solution of mesh equations on a parallel computer, Dep. of Computer Science, Univ., 1974. 
II 

[28] A. George, E. Ng, On row and column orderings for sparse least squares problems, SIAM Journal on 
Numerical Analysis 20 (2) (1983) 326-344. ^ 



31 



J. Gilbert, E. Zmijewski, A parallel graph partitioning algorithm for a message-passing multiprocessor, 
International Journal of Parallel Programming 16 (6) (1987) 427-449. [I] 

A. Pothen, H. Simon, K. Liou, Partitioning sparse matrices with eigenvectors of graphs, SIAM Journal 
on Matrix Analysis and Applications 11 (3). |4] 

M. Heath, E. Ng, B. Peyton, Parallel algorithms for sparse linear systems, SIAM review 33 (3) (1991) 
420-460. H 

D. Petersen, S. Li, K. Stokbro, H. S0rensen, P. Hansen, S. Skelboe, E. Darve, A hybrid method for the 
parallel computation of Green's functions. Journal of Computational Physics 228 (14) (2009) 5020-5039. 

E 

G. H. Golub, C. F. Van Loan, Matrix Computations, Third Edition, Johns Hopkins University Press, 
1996. [T§ [52] 

J. R. Bunch, B. N. Parlett, Direct methods for solving symmetric indefinite systems of linear equations. 



SIAM Journal on Numerical Analysis 8 (1971) 639-55. 16 



J. R. Bunch, L. Kaufman, Some stable methods for calculating inertia and solving symmetric linear 



systems. Mathematics of Computation 31 (1977) 162-79. 16 



U. Trottenberg, T. Glees, Multigrid software for industrial applications — from MGOO to SAMG, 100 



Volumes of Notes on Numerical Fluid Mechanics (2009) 423-436. 21 



A. Brandt, General highly accurate algebraic coarsening. Electronic Transactions on Numerical Analysis 
10 (1).[21] 

J. Fattebert, J. Bernholc, Towards grid-based 0{N) density-functional theory methods: Optimized 



nonorthogonal orbitals and multigrid acceleration. Physical Review B 62 (3) (2000) 1713. 21 



J. Bernholc, M. Hodak, W. Lu, Recent developments and applications of the real-space multigrid 



method. Journal of Physics: Condensed Matter 20 (2008) 294205. 21 



G. Feng, T. Beck, Nonlinear multigrid eigenvalue solver utilizing nonorthogonal localized orbitals, Phys- 



ica Status Solidi (b) 243 (5) (2006) 1054-1062. 21 



N. Wijesekera, G. Feng, T. Beck, EfRcient multiscale algorithms for solution of self-consistent eigenvalue 



problems in real space. Physical Review B 75 (11) (2007) 115101. 21 



32 



